Lab Section

Download auto data from the Statistical Learning book website here: http://www-bcf.usc.edu/~gareth/ISL/data.html

Today, we are going over Hierarchical clustering, K-Means Clustering, PCA, and ICA.

# read in Auto data
Auto_data <- read_csv("~/Desktop/MS/fall/machine learning/assignments/hw1-taitea/Auto.csv")
## Parsed with column specification:
## cols(
##   mpg = col_double(),
##   cylinders = col_double(),
##   displacement = col_double(),
##   horsepower = col_character(),
##   weight = col_double(),
##   acceleration = col_double(),
##   year = col_double(),
##   origin = col_double(),
##   name = col_character()
## )
#remove cars with unknown horsepower and set horsepower to numeric
Auto_data <- Auto_data %>% 
  filter(horsepower != "?") %>% 
  mutate(horsepower = as.numeric(horsepower)) %>% 
  as.data.frame()

#save car names 
Auto_data_names <- Auto_data$name

#data to cluster
Auto_data_clust <- Auto_data[,1:8]
dim(Auto_data_clust)
## [1] 392   8
#392 is too much for a demo, so lets take the first 25
Auto_data_clust <- Auto_data_clust[1:25,]
rownames(Auto_data_clust) <- Auto_data_names[1:25]

Hierarchical agglomerative clustering

Step 1. Assign each item to it’s own cluster. We start with 25 clusters, one for each car.

Step 2. Calculate a proximity matrix between each cluster.

Step 3. Find the pair of clusters closest to each other.

Step 4. Merge these clusters and then recalculate similarity between clusters. Some options are: single linkage (distance is calculated from the nearest neighbors), complete linkage (distance is calculated from furthest neighbor), average linkage (distance is calculated from mean of different clusters).

Step 5. Repeat Step 3 and 4 until there is only one cluster.

In practice

Step 1. Each car is a cluster.

Step 2. Create a distance matrix from Auto_data_clust.

help("dist")
## Help on topic 'dist' was found in the following packages:
## 
##   Package               Library
##   factoextra            /Users/taitea/Library/R/3.6/library
##   stats                 /Library/Frameworks/R.framework/Versions/3.6/Resources/library
## 
## 
## Using the first match ...
hierarchical_dist <- as.matrix(dist(Auto_data_clust, method = "euclidean"))
#View(hierarchical_dist)

Step 3. Find the two cars that are the most similar to each other and print the names of those two cars

diag(hierarchical_dist) <- NA
arrayInd(which.min(hierarchical_dist), dim(hierarchical_dist))
##      [,1] [,2]
## [1,]   23   15
#postitions 23 and 15 are the most similar. Lets go back to the names of the cars
Auto_data_names[23]
## [1] "saab 99e"
Auto_data_names[15]
## [1] "toyota corona mark ii"

Step 4. Merge the two clusters together using average linkage.

#replace pos 15 with the average of pos 15 and 23
hierarchical_dist[,15] <- apply((hierarchical_dist[,c(23,15)]),1,mean)
hierarchical_dist[15,] <- apply((hierarchical_dist[c(23,15),]),2,mean)

#remove pos 23
hierarchical_dist <- hierarchical_dist[-23,-23]

#now position 15 represents the cluster containing the saab99e and the toyota corona mark ii

Step 5. To complete the algorithm, go back to step 3 and iterate through all of the previous steps until there are no more rows left

diag(hierarchical_dist) <- NA
arrayInd(which.min(hierarchical_dist), dim(hierarchical_dist))
##      [,1] [,2]
## [1,]    4    3
#postitions 4 and 3 are the most similar
Auto_data_names[4]
## [1] "amc rebel sst"
Auto_data_names[3]
## [1] "plymouth satellite"

R function

Now that we know how the algorithm works, let’s use the R function hclust. Plot the Dendogram resulting from clustering the Auto_data_clust using average linkage.

hierarchical_dist <- dist(Auto_data_clust, method = "euclidean")
tree <- hclust(hierarchical_dist, method="average")
plot(tree)

There is one more element to hierarchical clustering: Cutting the tree. Here, we can control how many clusters we want or the height of the tree.

#help(cutree)

# cut tree into 3 clusters
tree <- hclust(hierarchical_dist, method="average")
plot(tree)
tree_k2 <- cutree(tree, k = 2)
# plot the tree before running this line 
rect.hclust(tree, k = 3, h = NULL)

Principal Components Analysis (PCA)

Principal Components Analysis is a linear dimensionality reduction algorithm. If you want to learn more about linear algebra, I suggest the MIT Open Courseware class here : https://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/ There are two ways of doing PCA, Single Value Decomposition (SVD), and the method we will use today, using the covariance matrix of the data.

Step 1. Center data by subtracting the mean.

Step 2. Calculate covariance matrix of data.

Step 3. Perform Eigendecomposition of the covariance matrix. i.e. represent the matrix in terms of it’s eigenvalues and eigen vectors

Step 4. Multiply the eigen vectors by the original data to express the data in terms of the eigen vectors.

Step 1. Center the data by subtracting the mean of the each column from the values in that column

Auto_data_clust_pca <- data.matrix(Auto_data_clust)

Center_auto <- apply(Auto_data_clust_pca, 2, function(x) x - mean(x))

Step 2. Calculate covariance matrix of the Auto data

Covariance_auto <- cov(Center_auto)

Step 3. Calculate eigen values and vectors

Eigen_value_auto <- eigen(Covariance_auto)$value

#columns are the eigen vectors
Eigen_vector_auto <- eigen(Covariance_auto)$vector

Step 4. Multiply the eigen vector matrix by the original data.

PC <- as.data.frame(data.matrix(Center_auto) %*% Eigen_vector_auto)

ggplot(PC, aes(PC[,1], PC[,2])) + geom_point(aes(PC[,1], PC[,2])) 

#+ geom_text(aes(label=Auto_data_names[1:8]), nudge_x = -2.5, nudge_y = 400)

Step 5. Find out which principal components explain the variance in the data.

#for each component, take the cumulative sum of eigen values up to that point and and divide by the total sum of eigen values
round(cumsum(Eigen_value_auto)/sum(Eigen_value_auto) * 100, digits = 2)
## [1]  99.52  99.95 100.00 100.00 100.00 100.00 100.00 100.00

Principal component 1 and 2 explain 99.99 percent of the variance. Principal component 1,2, and 3 together explain 100% of the variance in the data.

R function

Now that we know how PCA works, lets use the R funtion prcomp.

help("prcomp")
autoplot(prcomp(Auto_data_clust_pca))

Independent Component Analysis (ICA)

ICA is an algorithm that finds components that are independent, subcomponents of the data.

Step 1. Whiten the data by projecting the data onto the eigen vectors (PCA).

Step 2. Solve the X=AS equation by maximizing non-gaussianty in the variables(components) in S.

This results in a matrix S with components that are independent from each other.

We will use the fastICA algorithm.

First we will go backwards. Create a matrix S with the independent components

#create two signals
S <- cbind(cos((1:500)/10), ((500:1)/1000))

par(mfcol = c(1, 2))
plot(S[,1], type="l")
plot(S[,2], type="l")

Create a mixing matrix A

A <- matrix(c(0.5, 0.7, 0.423, 0.857), 2, 2)

Mix S using A

X <- S %*% A
par(mfcol = c(1, 2))
plot(X[,1], type="l")
plot(X[,2], type="l")

Unmix using fastICA

par(mfcol = c(1, 2))
plot(1:500, a$S[,1], type = "l", xlab = "S'1", ylab = "")
plot(1:500, a$S[,2], type = "l", xlab = "S'2", ylab = "")

ICA on the auto data

plot the independent components as a heatmap

heatmap(a$S)

Homework

data(iris)
  1. Subset the Iris dataset to only include Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width.
#subsetting by removing the extra variable and saving as a new dataset
iris_mod <- iris[c(-5)]

#saving species names
species_names <- iris$Species
  1. Write out the Kmeans algorithm by hand, and run two iterations of it.
#converting data from list to matrix
iris_mod_matrix=as.matrix(iris_mod)
centers <- iris_mod_matrix[sample(nrow(iris_mod_matrix), 3),]

#defining euclidean distance to be used in kmeans calculation
euclid <- function(points1, points2) {
  distanceMatrix <- matrix(NA, nrow=dim(points1)[1], ncol=dim(points2)[1])
  for(i in 1:nrow(points2)) {
    distanceMatrix[,i] <- sqrt(rowSums(t(t(points1)-points2[i,])^2))
  }
  distanceMatrix
}

#defining kmeans clustering as a function
K_means <- function(x, centers, distFun, nItter) {
  clusterHistory <- vector(nItter, mode="list")
  centerHistory <- vector(nItter, mode="list")

  for(i in 1:nItter) {
    distsToCenters <- distFun(x, centers)
    clusters <- apply(distsToCenters, 1, which.min)
    centers <- apply(x, 2, tapply, clusters, mean)
    clusterHistory[[i]] <- clusters
    centerHistory[[i]] <- centers
  }

  list(clusters=clusterHistory, centers=centerHistory)
}

#calling function for 2 iterations on data matrix and printing results
iris_cluster <- K_means(iris_mod_matrix, centers, euclid, 2)
iris_cluster
## $clusters
## $clusters[[1]]
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 1 3 1 3 1 3 1 1 1 1 1 1 3 1 1 1 1
##  [71] 3 1 3 1 1 3 3 3 1 1 1 1 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3
## [106] 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 3 3 3 3 3
## 
## $clusters[[2]]
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 1 3 1 3 1 3 1 3 1 1 1 1 1 1 3 1 1 1 1
##  [71] 3 1 3 1 1 1 3 3 1 1 1 1 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3
## [106] 3 1 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 3 3 3 3 3
## 
## 
## $centers
## $centers[[1]]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.681579    2.681579     4.113158    1.286842
## 2     5.006000    3.428000     1.462000    0.246000
## 3     6.617742    2.988710     5.391935    1.914516
## 
## $centers[[2]]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.729268    2.690244     4.151220    1.300000
## 2     5.006000    3.428000     1.462000    0.246000
## 3     6.632203    2.998305     5.430508    1.937288
  1. Run PCA on the Iris dataset. Plot a scatter plot of PC1 vs PC2 and include the percent variance those PCs describe.
#Step 1. Center data by subtracting the mean.
iris_data_clust_pca <- data.matrix(iris_mod)
center_iris <- apply(iris_data_clust_pca, 2, function(x) x - mean(x))

#Step 2. Calculate covariance matrix of data.
covariance_iris <- cov(center_iris)

#Step 3. Perform Eigendecomposition of the covariance matrix
eigen_value_iris <- eigen(covariance_iris)$value
eigen_vector_iris <- eigen(covariance_iris)$vector

#Step 4. Multiply the eigen vectors by the original data to express the data in terms of the eigen vectors.
PC <- as.data.frame(data.matrix(center_iris) %*% eigen_vector_iris)

#Step 5. Find out which principal components explain the variance in the data.
#plot PC1 and PC2 in scatterplot
ggplot(PC, aes(PC[,1], PC[,2])) + geom_point(aes(PC[,1], PC[,2]))

#calculate variance explained
round(cumsum(eigen_value_iris)/sum(eigen_value_iris) * 100, digits = 2)
## [1]  92.46  97.77  99.48 100.00

PC1 explains 92.46 percent of the variance. PC1 and PC2 explain 97.77 percent of the variance.

  1. Run ICA on the Iris dataset. Plot the independent components as a heatmap.
#create a matrix S with the independent components
S <- cbind(cos((1:500)/10), ((500:1)/1000))

par(mfcol = c(1, 2))
plot(S[,1], type="l")
plot(S[,2], type="l")

#create a mixing matrix A
A <- matrix(c(0.5, 0.7, 0.423, 0.857), 2, 2)

#mix S using A
X <- S %*% A
par(mfcol = c(1, 2))
plot(X[,1], type="l")
plot(X[,2], type="l")

#unmix using FastICA
a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
             method = "R", row.norm = FALSE, maxit = 200,
             tol = 0.0001, verbose = TRUE)

#ICA on the iris data
iris_ica <- fastICA(iris_mod, 7, alg.typ = "parallel", fun = "logcosh", alpha = 1,
             method = "R", row.norm = FALSE, maxit = 200,
             tol = 0.0001, verbose = TRUE)

#heatmap of iris ICA
heatmap(iris_ica$S)

  1. Use Kmeans to cluster the Iris data.
  • Use the silhouette function in the cluster package to find the optimal number of clusters for kmeans for the iris dataset. Then cluster using kmeans clustering. Does the data cluster by species?
  • Using this clustering, color the PCA plot according to the clusters.
library(cluster)
library(purrr)

#generating silhouette plot for a variety of cluster numbers
iris_k2 <- pam(iris, k=2)
plot(silhouette(iris_k2))

iris_k3 <- pam(iris, k=3)
plot(silhouette(iris_k3))

iris_k4 <- pam(iris, k=4)
plot(silhouette(iris_k4))

#calculating highest average silhouette width
sil_width <- map_dbl(2:10, function(k) {model <- pam(x = iris, k = k) 
  model$silinfo$avg.width})
sil_df <- data.frame(k = 2:10, sil_width = sil_width)
print(sil_df)
##    k sil_width
## 1  2 0.6825351
## 2  3 0.5728779
## 3  4 0.5036421
## 4  5 0.4948241
## 5  6 0.5074318
## 6  7 0.3725349
## 7  8 0.3440288
## 8  9 0.3040292
## 9 10 0.3052771
#kmeans clustering
set.seed(20)
clusters <- kmeans(iris_mod, 2)
iris$clusters <- as.factor(clusters$cluster)
str(clusters)
## List of 9
##  $ cluster     : int [1:150] 1 1 1 1 1 1 1 1 1 1 ...
##  $ centers     : num [1:2, 1:4] 5.01 6.3 3.37 2.89 1.56 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "1" "2"
##   .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##  $ totss       : num 681
##  $ withinss    : num [1:2] 28.6 123.8
##  $ tot.withinss: num 152
##  $ betweenss   : num 529
##  $ size        : int [1:2] 53 97
##  $ iter        : int 1
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"
#PCA plot by cluster
set.seed(20)
autoplot(kmeans(iris_mod, 2), data = iris_mod)

The silhouette analysis indicated that the optimal number of clusters is 2, but the dataset includes 3 species, so it does not cluster by species.

  1. Use hierarchical clustering to cluster the Iris data.
  • Try two different linkage types, and two different distance metrics.
  • For one linkage type and one distance metric, try two different cut points.
  • Using this clustering, color the PCA plot according to the clusters. (6 plots in total)
#hierarchical clustering with average linkage and euclidean distance metric
iris_hierarchical_dist_eu <- dist(iris_mod, method = "euclidean")
tree_1 <- hclust(iris_hierarchical_dist_eu, method="average")
plot(tree_1)

#hierarchical clustering with average linkage and maximum distance metric
iris_hierarchical_dist_max <- dist(iris_mod, method = "maximum")
tree_2 <- hclust(iris_hierarchical_dist_max, method="average")
plot(tree_2)

#hierarchical clustering with single linkage and euclidean distance metric
tree_3 <- hclust(iris_hierarchical_dist_eu, method="single")
plot(tree_3)

#hierarchical clustering with single linkage and maximum distance metric
tree_4 <- hclust(iris_hierarchical_dist_max, method="single")
plot(tree_4)

#hierarchical clustering with average linkage and euclidean distance metric with 2 cut points (3 and 4)
tree_5 <- hclust(iris_hierarchical_dist_eu, method="average")
plot(tree_5)
tree_k3 <- cutree(tree_5, k = 3)
rect.hclust(tree_5, k = 3, h = NULL)

tree_6 <- hclust(iris_hierarchical_dist_eu, method="average")
plot(tree_6)
tree_k4 <- cutree(tree_6, k = 4)
rect.hclust(tree_6, k = 4, h = NULL)

#plots of PCA colored by hierarchical clusters computed above
iris_tree_1 <- hcut(iris_hierarchical_dist_eu, hc_method = "average")
fviz_cluster(iris_tree_1, ellipse.type = "convex", data = iris_mod)

iris_tree_2 <- hcut(iris_hierarchical_dist_max, hc_method = "average")
fviz_cluster(iris_tree_2, ellipse.type = "convex", data = iris_mod)

iris_tree_3 <- hcut(iris_hierarchical_dist_eu, hc_method = "single")
fviz_cluster(iris_tree_3, ellipse.type = "convex", data = iris_mod)

iris_tree_4 <- hcut(iris_hierarchical_dist_max, hc_method = "single")
fviz_cluster(iris_tree_4, ellipse.type = "convex", data = iris_mod)

iris_tree_5 <- hcut(iris_hierarchical_dist_eu, k = 3, hc_method = "average")
fviz_cluster(iris_tree_5, ellipse.type = "convex", data = iris_mod)

iris_tree_6 <- hcut(iris_hierarchical_dist_eu, k = 4, hc_method = "average")
fviz_cluster(iris_tree_6, ellipse.type = "convex", data = iris_mod)

Optional material

On PCA:

Eigen Vectors and Eigen Values http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/ Linear Algebra by Prof. Gilbert Strang https://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/video-lectures/ http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

On ICA:

Independent Component Analysis: Algorithms and Applications https://www.cs.helsinki.fi/u/ahyvarin/papers/NN00new.pdf Tutorial on ICA taken from http://rstudio-pubs-static.s3.amazonaws.com/93614_be30df613b2a4707b3e5a1a62f631d19.html